variant

您所在的位置:网站首页 samtools sort sam to bam variant

variant

#variant| 来源: 网络整理| 查看: 265

Alignment File Processing Learning objectives Differentiate between query-sorted and coordinate-sorted alignment files Describe and remove duplicate reads Process a raw SAM file for input into a BAM for GATK Alignment file processing with samtools and Picard

The processing of the alignment files (SAM/BAM files) can be done either with samtools or Picard and they are, for the most part, interchangable. The arguments for either are below:

samtools

The software we will be using to call variants (GATK) can be picky about how the files are formatted and so you need to be careful about the formatting as to not produce an error in GATK More user-friendly Multi-threaded

Picard

Maintained by the Broad Institute, so it has excellent integration with GATK (also maintained by the Broad Institute) Written in Java and the error messages are not as easily interpretable Single-threaded

If implemented correctly, they largely provide the same output. In this workshop, we will be using Picard, but the samtools code will be provided in a dropdown for each section if you would like to know how to do the step in samtools.

Pipeline for processing alignment file with Picard

Before we start processing our alignment SAM file provided by bwa, let's briefly discuss the steps that we will be doing in this pipeline. Several goals need to be accomplished:

Compress SAM file to BAM file. The output of bwa is a SAM file and it is human readbale. However, it is quite large and we need to compress it to a binary version (BAM) which is much smaller. Query-sort our alignment file. Alignment file are initally ordered by the order of the reads in the FASTQ file, which is not particularly useful. Picard can more exhaustively look for duplicates if the file is sorted by read-name (query-sorted). We will discuss query-sorted and coordinate-sorted alignment files soon. Mark and Remove Duplicates. Duplicates can introduce bias into our analysis so it is considered best practice to remove them prior to variant calling. Coordinate-sort our alignment file. Most downstream software packages require that alignment files be coordinate-sorted, so we will need to re-sort our alignment file by coordinates now that we have removed the duplicates. Index the alignment file. Like the index for a book, indicies for alignment files help direct downstream software packages to where to they can find specific reads. Many software packages require the alignment file that you are analyzing to have an index file, usually with the same name as your alignment file, with the additional .bai (BAM-index) extension or a .bai extension instead of .bam. Both Picard and samtools have a way of integrating this indexing as part of their sorting protocols and that is what we will be using. However, both packages have commands for indexing a BAM file independent of their sorting protocol (BuildBamIndex for Picard and index for samtools).

Below is a flow chart of the Picard pipeline that we will be using:

NOTE: Some pipelines will have you add read groups while procressing your alignment files. This may be necessary in those pipelines, because they either can't add the read groups within their alignment tool (a problem that is pretty rare now), they forgot to add their read groups or they didn't know that they could add read groups during the alignment step with their alignment tool. Regardless of the reason, you should be aware that it is not uncommon to see this step in other's pipelines as many programs including GATK require read groups in order to run. It doesn't matter too much when you add the read groups in these processing steps, but we would recommend doing it first. Also, a command one could use to add read group information, and the one we show in the dropdown below, is Picard's AddOrReplaceReadGroups, which has the added benefit of allowing you to also sort your alignment file (our first step anyways) in the same step as adding the read group information. The dropdown below discusses how to add or replace read groups within Picard.

Click here if you need to add or replace read groups using Picard In order to add or replace read groups, we are going to use Picard's AddOrReplaceReadGroups tool. First we would need to load the Picard module: # Load module module load picard/2.27.5

The general syntax for AddOrReplaceReadGroups is:

# Add or replace read group information java -jar $PICARD/picard.jar AddOrReplaceReadGroups \ --INPUT $SAM_FILE \ --OUTPUT $BAM_FILE \ --RGID $READ_GROUP_ID \ --RGLB $READ_GROUP_LIBRARY \ --RGPL $READ_GROUP_PLATFORM \ --RGPU $READ_GROUP_PLATFORM_UNIT \ --RGSM $READ_GROUP_SAMPLE java -jar $PICARD/picard.jar AddOrReplaceReadGroups This calls the AddOrReplaceReadGroups package within Picard --INPUT $SAM_FILEThis is your input file. It could be a BAM/SAM alignment file, but because we recommend doing this first if you need to do it, this would be a SAM file. You don't need to specifiy that it is a BAM/SAM file, Picard with figure that out from the provided extension. --OUTPUT $BAM_FILEThis would be your output file. It could be BAM/SAM, but you would mostly likely pick BAM because you'd like to save space on the cluster. You don't need to specifiy that it is a BAM/SAM file, Picard with figure that out from the provided extension. --RGID $READ_GROUP_IDThis is your read group ID and must be unique --RGLB $READ_GROUP_LIBRARYThis is your read group library --RGPL $READ_GROUP_PLATFORMThis is the platform used for the sequencing --RGPU $READ_GROUP_PLATFORM_UNITThis is the unit used to do the sequencing --RGSM $READ_GROUP_SAMPLEThis is the sample name that the sequencing was done on

We discussed the Read Group tags previously in the Sequence Alignment Theory and more information on them can be found here.

If you would also sort your SAM/BAM file at the same time, you just need to add the --SORT_ORDER option to your command. If you don't add it, it will leave your reads in the same order as they were provided. The main two sort orders to be aware of are query-sorted and coordinate-sorted. A full discussion of them can be found shortly below. If you wanted the output to be query-sorted, then you could use:

--SORT_ORDER queryname

Or if you wanted them to be coordinate-sorted than you could use:

--SORT_ORDER coordinate

NOTE: You may encounter a situation where your reads from a single sample were sequenced across different lanes/machines. As a result, each alignment will have different read group IDs, but the same sample ID (the SM tag in your SAM/BAM file). You will need to merge these files here before continuing. The dropdown menu below will detail how to do this.

Click here if you need to merge alignment files from the same sample You can merge alignment files with different read group IDs from the same sample in both Picard and samtools. In the dropdowns below we will outline each method: Click to see how to merge SAM/BAM files in Picard First, we need to load the Picard module: module load picard/2.27.5 We can define our variables as: INPUT_BAM_1=Read_group_1.bam INPUT_BAM_2=Read_group_2.bam MERGED_OUTPUT=Merged_output.bam Here is the command we would need to run to merge the SAM/BAM files: java -jar $PICARD/picard.jar MergeSamFiles \ --INPUT $INPUT_BAM_1 \ --INPUT $INPUT_BAM_2 \ --OUTPUT $MERGED_OUTPUT We can breakdown this command: java -jar $PICARD/picard.jar MergeSamFiles This calls the MergeSamFiles from within Picard --INPUT $INPUT_BAM_1 This is the first SAM/BAM file that we would like to merge. --INPUT $INPUT_BAM_2 This is the second SAM/BAM file that we would like to merge. We can continue to add --INPUT lines as needed. --OUTPUT $MERGED_OUTPUT This is the output merged SAM/BAM file Click to see how to merge SAM/BAM files in samtools First, we need to load the samtools module, which also requires gcc to be loaded: module load gcc/6.2.0 module load samtools/1.15.1 We can define our variables as: INPUT_BAM_1=Read_group_1.bam INPUT_BAM_2=Read_group_2.bam MERGED_OUTPUT=Merged_output.bam THREADS=8 Here is the command we would need to run to merge the SAM/BAM files: samtools merge \ -o $MERGED_OUTPUT \ $INPUT_BAM_1 \ $INPUT_BAM_2 \ --output-fmt BAM \ -@ $THREADS We can break down this command: samtools merge This calls the merge package within samtools. -o $MERGED_OUTPUT This is the merged output file. $INPUT_BAM_1 This is the first SAM/BAM file that we would like to merge. $INPUT_BAM_2 This is the second SAM/BAM file that we would like to merge. We can continue to add addiitonal input SAM/BAM files to this list as needed. --output-fmt BAM This specifies the output format as BAM. If for some reason you wanted a SAM output file then you would use --output-fmt SAM instead. -@ $THREADS This specifies the number of threads we want to use for this process. We are using 8 threads in this example, but this could be different depending on the parameters that you would like to use. Creating our sbatch script

Let's go ahead and start making a new sbatch within vim:

cd ~/variant_calling/scripts/ vim picard_alignment_processing_normal.sbatch

Start the sbatch script with our shebang line, description of the script and our sbatch directives.

#!/bin/bash # This sbatch script is for processing the alignment output from bwa and preparing it for use in GATK using Picard # Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-04:00:00 #SBATCH -c 1 #SBATCH --mem 32G #SBATCH -o picard_alignment_processing_normal_%j.out #SBATCH -e picard_alignment_processing_normal_%j.err

Load the Picard module:

# Load module module load picard/2.27.5

Note: Picard is one of the pieces of software that does NOT require gcc/6.2.0 to also be loaded

Next, let's define some variables that we will be using:

# Assign file paths to variables SAM_FILE=/n/scratch3/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.sam REPORTS_DIRECTORY=/home/${USER}/variant_calling/reports/picard/syn3_normal/ SAMPLE_NAME=syn3_normal QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` REMOVE_DUPLICATES_BAM_FILE=`echo ${SAM_FILE%sam}remove_duplicates.bam` METRICS_FILE=${REPORTS_DIRECTORY}/${SAMPLE_NAME}.remove_duplicates_metrics.txt COORDINATE_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}coordinate_sorted.bam`

Make a directory to hold the Picard reports:

# Make reports directory mkdir -p $REPORTS_DIRECTORY Sorting and Removing Duplicates

In order to appropriately flag and remove duplicates, we first need to query sort our SAM file. Oftentimes, when people discuss sorted BAM/SAM files, they are refering to coordinate-sorted BAM/SAM files.

Query-sorted BAM/SAM files are sorted based upon their read names and order lexiographically Coordinate-sorted BAM/SAM files are sorted by their sequence name (chromosome/linkage group/scaffold) and position

Picard can mark and remove duplicates in either coordinate-sorted or query-sorted BAM/SAM files, however, if the alignments are query-sorted it can test secondary alignments for duplicates. A brief discussion of this nuance is discussed in the MarkDuplicates manual of Picard. As a result, we will first query-sort our SAM file and convert it to a BAM file:

Query-sort the Alignment File

With our reads aligned to the reference, we would see if we opened up the SAM file that the alignments are in the order they were processed by bwa and not in any particular order that would be useful for downstream analyses. So, we are going to sort them into order by query (read) name for the downstream MarkDuplicates tool. As a reminder, we are going to sort our BAM file by coordinates later in the processing and when people discuss a "sorted BAM/SAM" file they are usually referring to a BAM/SAM file that is coordinate-sorted.

Additionally, while SAM files are nice due to their human readability, they are typically quite large files and it is not an efficient use of space on the cluster. Fortunately, there is a binary compression version of SAM called BAM. While we sort the reads, we are going to use to convert our SAM file to a BAM file. We don't need to specify this SAM-to-BAM conversion explicitly, because Picard will make this change by interpretting the file extensions that we provide in the INPUT and OUTPUT file options.

# Query-sort alginment file and convert to BAM java -jar $PICARD/picard.jar SortSam \ --INPUT $SAM_FILE \ --OUTPUT $QUERY_SORTED_BAM_FILE \ --SORT_ORDER queryname

The components of this command are:

java -jar $PICARD/picard.jar SortSam Calls Picard's SortSam software package

--INPUT $SAM_FILE This is where we provide the SAM input file

--OUTPUT $QUERY_SORTED_BAM_FILE This is the BAM output file. Because the extension is .bam rather than .sam, Picard will recognize this and create the output as a BAM file rather than as a SAM file like the input we have provided it.

--SORT_ORDER queryname The method with which we would like the file to be sorted. The options here are either queryname or coordinate.

NOTE: The syntax that Picard uses is quite particular and the syntax shown in the documentation is not always consistent. There are two main ways for providing input for Picard:

The Traditional Syntax When providing inputs in this fashion you will provide the option followed immediately without any whitespace by an = sign followed once again immediately without whitespace by the argument for that option. For example, we could have written the SortSam command as:

# Query-sort alginment file and convert to BAM java -jar $PICARD/picard.jar SortSam \ INPUT=$SAM_FILE \ OUTPUT=$QUERY_SORTED_BAM_FILE \ SORT_ORDER=queryname

And this while work and produce a valid output, your error file might contain a warning that looks something like:

********* NOTE: Picard's command line syntax is changing. ********* ********* For more information, please see: ********* https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition) ********* ********* The command line looks like this in the new syntax: ********* ********* SortSam -INPUT /n/scratch3/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal.sam -OUTPUT /n/scratch3/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal.query_sorted.bam -SORT_ORDER queryname *********

It should also be noted that you can, as this is true for many software programs, provide a single/couple letter abbreviations for an option. For example, this command could have also been validly written as:

# Query-sort alginment file and convert to BAM java -jar $PICARD/picard.jar SortSam \ I=$SAM_FILE \ O=$QUERY_SORTED_BAM_FILE \ SO=queryname

However, we have elected to write out the full names of options wherever possible to increase readabilty of the code. This is entirely a preference issue and as you get more familiar with these commands, you may want to use thier abbreviations to reduce time typing and potential typos.

The New (Barclay) Syntax Like the warning in the error output above described, Picard migrated to a new syntax several years back and this is the syntax we are demonstrating in this workshop. It uses a double hyphen (--) followed by the long-form name for the option followed by whitespace followed by the argument for that option. This is what we used above:

# Query-sort alginment file and convert to BAM java -jar $PICARD/picard.jar SortSam \ --INPUT $SAM_FILE \ --OUTPUT $QUERY_SORTED_BAM_FILE \ --SORT_ORDER queryname

Similiarly to the Traditional syntax, you can also use the single/couple letter abbreviations for an option. However, if you do this, then you should only use a single hyphen (-). This practice of using double hypohens for the long-form name of an option and a single hyphen for the abbreviate form of an option is very common in MANY software packages and will often be annotated in manuals as:

-I, --INPUT This is the input file

The above command using abbreivations would look like:

# Query-sort alginment file and convert to BAM java -jar $PICARD/picard.jar SortSam \ -I $SAM_FILE \ -O $QUERY_SORTED_BAM_FILE \ -SO queryname

All four of these commands are equaly valid and all produce the same output. However, we think you should be aware of them as Picard's documentation will sometimes use the New (Barclay) syntax is one place and the Traditional syntax in another place, sometimes even on the same page! Additionally, you may come across both syntaxes when trying to troubleshoot any errors you encounter when using Picard and thus you should be familiar with this peculiarity of Picard. We recommend Using the New (Barclay) syntax as to reduce error messages in your error output files and because Picard has been trying to migrate towards it.

Mark and Remove Duplicates

An important step in processing a BAM file is to mark and remove PCR duplicates. These PCR duplicates can introduce artifacts because regions that have preferential PCR amplification could be over-represented. These reads are flagged by having identical mapping locations in the BAM file. Importantly, it is impossible to distinguish between PCR duplicates and identical fragments. However, one can reduce the latter by doing paired-end sequencing and providing appropriate amounts of input material.

Now we will add the command to our script that allows us to mark and remove duplicates in Picard:

# Mark and remove duplicates java -jar $PICARD/picard.jar MarkDuplicates \ --INPUT $QUERY_SORTED_BAM_FILE \ --OUTPUT $REMOVE_DUPLICATES_BAM_FILE \ --METRICS_FILE $METRICS_FILE \ --REMOVE_DUPLICATES true

The components of this command are:

java -jar $PICARD/picard.jar MarkDuplicates Calls Picard's MarkDuplicates program

--INPUT $QUERY_SORTED_BAM_FILE Uses our query-sorted BAM file as input

--OUTPUT $REMOVE_DUPLICATES_BAM_FILE Write the output to a BAM file

--METRICS_FILE $METRICS_FILE Creates a metrics file (required by Picard MarkDuplicates)

--REMOVE_DUPLICATES true Not only are we going to mark/flag our duplicates, we can also remove them. By setting the REMOVE_DUPLICATES parameter equal to true to can remove the duplicates.

Coordinate-sort the Alignment File

For most downstream processes, coordinate-sorted alignment files are required. As a result, we will need to change our alignment file from being query-sorted to being coordinate-sorted and we will once again use the SortSam command within Picard to accomplish this. Since this BAM file will be the final BAM file that we make and will use for downstream analyses, we will need to create an index for it at the same time. The command we will be using for coordinate-sorting and indexing our BAM file is:

# Coordinate-sort BAM file and create BAM index file java -jar $PICARD/picard.jar SortSam \ --INPUT $REMOVE_DUPLICATES_BAM_FILE \ --OUTPUT $COORDINATE_SORTED_BAM_FILE \ --SORT_ORDER coordinate \ --CREATE_INDEX true

The components of this command are:

java -jar $PICARD/picard.jar SortSam Calls Picard's SortSam program

--INPUT $REMOVE_DUPLICATES_BAM_FILE Our BAM file once we have removed the duplicate reads. NOTE: Even though the software is called SortSam, it can use BAM or SAM files as input and also BAM or SAM files as output.

--OUTPUT COORDINATE_SORTED_BAM_FILE Our BAM output file sorted by coordinates.

--SORT_ORDER coordinate Sort the output file by coordinates

--CREATE_INDEX true Setting the CREATE_INDEX equal to true will create an index of the final BAM output. The index creation can also be accomplished by using the BuildBamIndex command within Picard, but this CREATE_INDEX functionality is built into many Picard functions, so you can often use it at the last stage of processing your BAM file to save having to run BuildBamIndex after.

Your final sbatch script for Picard should look like:

#!/bin/bash # This sbatch script is for processing the alignment output from bwa and preparing it for use in GATK using Picard # Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-04:00:00 #SBATCH -c 1 #SBATCH --mem 32G #SBATCH -o picard_alignment_processing_normal_%j.out #SBATCH -e picard_alignment_processing_normal_%j.err # Load module module load picard/2.27.5 # Assign file paths to variables SAM_FILE=/n/scratch3/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.sam REPORTS_DIRECTORY=/home/${USER}/variant_calling/reports/picard/syn3_normal/ SAMPLE_NAME=syn3_normal QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` REMOVE_DUPLICATES_BAM_FILE=`echo ${SAM_FILE%sam}remove_duplicates.bam` METRICS_FILE=${REPORTS_DIRECTORY}/${SAMPLE_NAME}.remove_duplicates_metrics.txt COORDINATE_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}coordinate_sorted.bam` # Make reports directory mkdir -p $REPORTS_DIRECTORY # Query-sort alginment file and convert to BAM java -jar $PICARD/picard.jar SortSam \ --INPUT $SAM_FILE \ --OUTPUT $QUERY_SORTED_BAM_FILE \ --SORT_ORDER queryname # Mark and remove duplicates java -jar $PICARD/picard.jar MarkDuplicates \ --INPUT $QUERY_SORTED_BAM_FILE \ --OUTPUT $REMOVE_DUPLICATES_BAM_FILE \ --METRICS_FILE $METRICS_FILE \ --REMOVE_DUPLICATES true # Coordinate-sort BAM file and create BAM index file java -jar $PICARD/picard.jar SortSam \ --INPUT $REMOVE_DUPLICATES_BAM_FILE \ --OUTPUT $COORDINATE_SORTED_BAM_FILE \ --SORT_ORDER coordinate \ --CREATE_INDEX true Click here for alignment file processing using Samtools Samtools is another popular tool used for processing BAM/SAM files. The output from Samtools compared to Picard is largely the same. Below is the pipeline and explanation for how you would carry out the similar SAM/BAM processing steps within Samtools.

Click here for setting up a sbatch script BAM/SAM Processing for the Samtools pipeline Setting up sbatch Script First, we are going to navigate to our scirpts folder and open a new sbatch submission script in vim: cd ~/variant_calling/scripts/ vim samtools_processing_normal.sbatch Next, we are going to need to set-up our sbatch submission script with our shebang line, description, sbatch directives, modules to load and file variables. #!/bin/bash # This sbatch script is for processing the alignment output from bwa and preparing it for use in GATK using Samtools # Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-04:00:00 #SBATCH -c 8 #SBATCH --mem 16G #SBATCH -o samtools_processing_normal_%j.out #SBATCH -e samtools_processing_normal_%j.err # Load modules module load gcc/6.2.0 module load samtools/1.15.1 # Assign file paths to variables SAM_FILE=/n/scratch3/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.sam QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` FIXMATE_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}fixmates.bam` COORDINATE_SORTED_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}coordinate_sorted.bam` REMOVED_DUPLICATES_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}removed_duplicates.bam` Click here for query-sorting a SAM file and converting it to BAM for the Samtools pipeline Similarly to Picard, we are going to need to initally query-sort our alignment. We are also going to be converting the SAM file into a BAM file at this step. Also similarly to Picard, we don't need to specify that our input files are BAM or SAM files. Samtools will use the extensions you provide it in your file names as guidance for whether you are providing it a BAM/SAM file. Below is the code we will use to query-sort our SAM file and convert it into a BAM file: # Sort SAM file and convert it to a query name sorted BAM file samtools sort \ -@ 8 \ -n \ -o $QUERY_SORTED_BAM_FILE \ $SAM_FILE

The components of this line of code are:

samtools sort This calls the sort function within samtools. -@ 8 This tells samtools to use 8 threads when it multithreads this task. Since we requested 8 cores for this sbatch submission, let's go ahead and use them all. -n This argument tells samtools sort to sort by read name as opposed the the default sorting which is done by coordinate. -O bam This is declaring the output format of .bam. -o $QUERY_SORTED_BAM_FILE This is a bash variable that holds the path to the output file of the samtools sort command. $SAM_FILE This is a bash variable holding the path to the input SAM file. Click here for fixing mate information for the Samtools pipeline Next, we are going to add more mate-pair information to the alignments including the insert size and mate pair coordinates. It is important to note that for this command samtools relies on positional parameters for assigning the input and output BAM files. In this case the input BAM file ($QUERY_SORTED_BAM_FILE) needs to come before the output file ($FIXMATE_BAM_FILE): # Score mates samtools fixmate \ -@ 8 \ -m \ $QUERY_SORTED_BAM_FILE \ $FIXMATE_BAM_FILE

The parts of this command are:

samtools fixmate This calls the fixmate command in samtools -@ 8 This tells samtools to use 8 threads when it multithreads this task. -m This will add the mate score tag that will be critically important later for samtools markdup $QUERY_SORTED_BAM_FILE This is our input file $FIXMATE_BAM_FILE This is our output file Click here for coordinate-sorting a BAM file for the Samtools pipeline

Now that we have added the fixmate information, we need to coordinate-sort the BAM file. We can do that by:

# Sort BAM file by coordinate samtools sort \ -@ 8 \ -o $COORDINATE_SORTED_BAM_FILE \ $FIXMATE_BAM_FILE

We have gone through all of the these paramters already in the previous samtools sort command. The only difference in this command is that we are not using the -n option, which tells samtools to sort by read name. Now, we are excluding this and thus sorting by coordinates, the default setting.

Click here for marking and removing duplicates for the Samtools pipeline

Now we are going to mark and remove the duplicate reads:

# Mark and remove duplicates and then index the output file samtools markdup \ -r \ --write-index \ -@ 8 \ $COORDINATE_SORTED_BAM_FILE \ ${REMOVED_DUPLICATES_BAM_FILE}##idx##${REMOVED_DUPLICATES_BAM_FILE}.bai

The components of this command are:

samtools markdup This calls the mark duplicates software in samtools -r This removes the duplicate reads --write-index This writes an index file of the output -@ 8 This sets that we will be using 8 threads $BAM_FILE This is our BAM input file ${REMOVED_DUPLICATES_BAM_FILE}##idx##${REMOVED_DUPLICATES_BAM_FILE}.baiThis has two parts: The first part (${REMOVED_DUPLICATES_BAM_FILE}) is our BAM output file with the duplicates removed from it The second part (##idx##${REMOVED_DUPLICATES_BAM_FILE}.bai) is a shortcut to creating a .bai index of the BAM file. If we use the --write-index option without this second part, it will create a .csi index file. .bai index files are a specific type of .csi files, so we need to specify it with the second part of this command to ensure that a .bai index file is created rather than a .csi index file. Click here for the final normal sample sbatch script to do the BAM/SAM processing for the Samtools pipeline

The final script should look like:

#!/bin/bash # This sbatch script is for processing the alignment output from bwa and preparing it for use in GATK using Samtools # Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-04:00:00 #SBATCH -c 8 #SBATCH --mem 16G #SBATCH -o samtools_processing_normal_%j.out #SBATCH -e samtools_processing_normal_%j.err # Load modules module load gcc/6.2.0 module load samtools/1.15.1 # Assign file paths to variables SAM_FILE=/n/scratch3/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.sam QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` FIXMATE_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}fixmates.bam` COORDINATE_SORTED_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}coordinate_sorted.bam` REMOVED_DUPLICATES_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}removed_duplicates.bam` # Sort SAM file and convert it to a query name sorted BAM file samtools sort \ -@ 8 \ -n \ -o $QUERY_SORTED_BAM_FILE \ $SAM_FILE # Score mates samtools fixmate \ -@ 8 \ -m \ $QUERY_SORTED_BAM_FILE \ $FIXMATE_BAM_FILE # Sort BAM file by coordinate samtools sort \ -@ 8 \ -o $COORDINATE_SORTED_BAM_FILE \ $FIXMATE_BAM_FILE # Mark and remove duplicates and then index the output file samtools markdup \ -r \ --write-index \ -@ 8 \ $COORDINATE_SORTED_BAM_FILE \ ${REMOVED_DUPLICATES_BAM_FILE}##idx##${REMOVED_DUPLICATES_BAM_FILE}.bai Click here for the final tumor sample sbatch script to do the BAM/SAM processing for the samtools pipeline In order to create the tumor sbatch submission script to process the BAM/SAM file using samtools, we will once again use sed: sed 's/normal/tumor/g' samtools_processing_normal.sbatch > samtools_processing_tumor.sbatch The final sbatch submission script for the tumor sample should look like: #!/bin/bash # This sbatch script is for processing the alignment output from bwa and preparing it for use in GATK using Samtools # Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-04:00:00 #SBATCH -c 8 #SBATCH --mem 16G #SBATCH -o samtools_processing_tumor_%j.out #SBATCH -e samtools_processing_tumor_%j.err # Load modules module load gcc/6.2.0 module load samtools/1.15.1 # Assign file paths to variables SAM_FILE=/n/scratch3/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_tumor_GRCh38.p7.sam QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` FIXMATE_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}fixmates.bam` COORDINATE_SORTED_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}coordinate_sorted.bam` REMOVED_DUPLICATES_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}removed_duplicates.bam` # Sort SAM file and convert it to a query name sorted BAM file samtools sort \ -@ 8 \ -n \ -o $QUERY_SORTED_BAM_FILE \ $SAM_FILE # Score mates samtools fixmate \ -@ 8 \ -m \ $QUERY_SORTED_BAM_FILE \ $FIXMATE_BAM_FILE # Sort BAM file by coordinate samtools sort \ -@ 8 \ -o $COORDINATE_SORTED_BAM_FILE \ $FIXMATE_BAM_FILE # Mark and remove duplicates and then index the output file samtools markdup \ -r \ --write-index \ -@ 8 \ $COORDINATE_SORTED_BAM_FILE \ ${REMOVED_DUPLICATES_BAM_FILE}##idx##${REMOVED_DUPLICATES_BAM_FILE}.bai Click here for information on indexing a BAM file Many software packages want an index of your BAM file in order to facilitate fast look-ups of a BAM file. While not all software packages that use a BAM file will require this, many will and thus it is a good practice to index your BAM file while processing it. In our previous line of Picard command, we provided it with the CREATE_INDEX=true option, so it automatically created an index for us after coordinate-sorting our BAM file. If for some reason we needed to create an BAM-index for a coordinate-sorted BAM file, we can do this in Picard or samtools. BAM-Indexing within Picard The code for indexing a BAM file in Picard would look like: # SKIP THIS STEP # Index the BAM file java -jar $PICARD/picard.jar BuildBamIndex \ --INPUT $BAM_FILE

The components of this command are:

java -jar $PICARD/picard.jar BuildBamIndex This calls the BuildBamIndex tools within Picard --INPUT $BAM_FILE This is the BAM file that you wish to index. BAM-Indexing within samtools The command to index a BAM file with samtools would be: #### SKIP THIS STEP # Index the BAM file samtools index \ $BAM_FILE

The components of this command are:

samtools index Calls the index function within samtools $BAM_FILE This is a bash variable that holds the path to the BAM file that we want to index.

We don't need to provide an output file for samtools index, by default it will generate a new file using the same path and filename as the BAM file, but add .bai as the extension instead of .bam to denote that it is a BAM-index file.

NOTE: BAM indexes can only be made from coordinate-sorted BAM files. Exercises

1. When inspecting a SAM file you see the following order:

Is this SAM file's sort order likely: unsorted, query-sorted, coordinate-sorted or is it ambiguous?

2. We are comparing our SortSam command with our colleague's command. Is there anything wrong with their syntax? Why or why not?

Our syntax

java -jar $PICARD/picard.jar SortSam \ --INPUT $SAM_FILE \ --OUTPUT $QUERY_SORTED_BAM_FILE \ --SORT_ORDER queryname

Our colleague's syntax

java -jar $PICARD/picard.jar SortSam \ I=$SAM_FILE \ O=$QUERY_SORTED_BAM_FILE \ SO=queryname Creating the Tumor SAM/BAM procressing

Similarly to the bwa script, we will now need use sed to create a sbatch script that will be used for processing the tumor SAM file into a BAM file that can be used as input to GATK. The sed command to do this would be:

sed 's/normal/tumor/g' picard_alignment_processing_normal.sbatch > picard_alignment_processing_tumor.sbatch

As a result your tumor Picard alignment processing script should look like:

#!/bin/bash # This sbatch script is for processing the alignment output from bwa and preparing it for use in GATK using Picard # Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-04:00:00 #SBATCH -c 1 #SBATCH --mem 32G #SBATCH -o picard_alignment_processing_tumor_%j.out #SBATCH -e picard_alignment_processing_tumor_%j.err # Load module module load picard/2.27.5 # Assign file paths to variables SAM_FILE=/n/scratch3/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_tumor_GRCh38.p7.sam REPORTS_DIRECTORY=/home/${USER}/variant_calling/reports/picard/syn3_tumor/ SAMPLE_NAME=syn3_tumor QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` REMOVE_DUPLICATES_BAM_FILE=`echo ${SAM_FILE%sam}remove_duplicates.bam` METRICS_FILE=${REPORTS_DIRECTORY}/${SAMPLE_NAME}.remove_duplicates_metrics.txt COORDINATE_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}coordinate_sorted.bam` # Make reports directory mkdir -p $REPORTS_DIRECTORY # Query-sort alginment file and convert to BAM java -jar $PICARD/picard.jar SortSam \ --INPUT $SAM_FILE \ --OUTPUT $QUERY_SORTED_BAM_FILE \ --SORT_ORDER queryname # Mark and remove duplicates java -jar $PICARD/picard.jar MarkDuplicates \ --INPUT $QUERY_SORTED_BAM_FILE \ --OUTPUT $REMOVE_DUPLICATES_BAM_FILE \ --METRICS_FILE $METRICS_FILE \ --REMOVE_DUPLICATES true # Coordinate-sort BAM file and create BAM index file java -jar $PICARD/picard.jar SortSam \ --INPUT $REMOVE_DUPLICATES_BAM_FILE \ --OUTPUT $COORDINATE_SORTED_BAM_FILE \ --SORT_ORDER coordinate \ --CREATE_INDEX true Submitting Picard processing

Now we are ready to submit our normal and tumor Picard processing scripts to the O2 cluster. However, we might have a problem. If you managed to go quickly into this lesson from the previous lesson, your bwa alignment scripts may still be running and your SAM files are not complete yet!

First, we need to check the status of our bwa scripts and we can do this with the command:

squeue -u $USER

If you have bwa jobs still running, then wait for them to complete (less than 2 hours) before continuing. There are ways to queue jobs together in SLURM using the --dependency option in sbatch. We will go over this in the automation lesson, but for now just hang tight until your jobs have finished.

If the only job running is your interactive job, then it should be time to start your Picard processing scripts. You can go ahead and submit your sbatch scripts for Picard processing with:

sbatch picard_alignment_processing_normal.sbatch sbatch picard_alignment_processing_tumor.sbatch

Next Lesson >>

Back to Schedule

This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3